The Construction of a Chinese-English Patent Parallel Corpus
نویسندگان
چکیده
In this paper, we describe the construction of a parallel Chinese-English patent sentence corpus which is created from noisy parallel patents. First, we use a publicly available sentence aligner to find parallel sentence candidates in the noisy parallel data. Then we compare and evaluate three individual measures and different ensemble techniques to sort the parallel sentence candidates according to the confidence score and filter out those with low scores as the noisy data. The experiment shows that the combination of measures outperforms the individual measures, and that filtering out low-quality sentence pairs is readily justified as it can improve SMT performance. Finally, we arrive at the final corpus consisting of 160K sentence pairs in which about 90% are correct or partially correct alignments.
منابع مشابه
Chinese-English Parallel Corpus Construction and its Application
Chinese-English parallel corpora are key resources for Chinese-English cross-language information processing, Chinese-English bilingual lexicography, Chinese-English language research and teaching. But so far large-scale Chinese-English corpus is still unavailable yet, given the difficulties and the intensive labours required. In this paper, our work towards building a large-scale Chinese-Engli...
متن کاملCreating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction
This paper first describes an experiment to construct an English-Chinese parallel corpus, then applying the Uplug word alignment tool on the corpus and finally produce and evaluate an English-Chinese word list. The Stockholm English-Chinese Parallel Corpus (SEC) was created by downloading English-Chinese parallel corpora from a Chinese web site containing law texts that have been manually trans...
متن کاملCOPPA V2.0: Corpus Of Parallel Patent Applications Building Large Parallel Corpora with GNU Make
WIPO seeks to help users and researchers to overcome the language barrier when searching patents published in different languages. Having collected a big multilingual corpus of translated patent applications, WIPO decided to share this corpus in a product called COPPA (Corpus Of Parallel Patent Applications) to stimulate research in Machine Translation and in language tools for patent texts. A ...
متن کاملComparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کاملA Japanese-English Patent Parallel Corpus
We describe a Japanese-English patent parallel corpus created from the Japanese and US patent data provided for the NTCIR-6 patent retrieval task. The corpus contains about 2 million sentence pairs that were aligned automatically. This is the largest Japanese-English parallel corpus, which will be available to the public after the 7th NTCIR workshop meeting. We estimated that about 97% of the s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009